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Abstract 

We propose a generic design for Chinese remainder algorithms. A Chi- 
nese remainder computation consists in reconstructing an integer value 
from its residues modulo non coprime integers. We also propose an ef- 
ficient linear data structure, a radix ladder, for the intermediate storage 
and computations. Our design is structured into three main modules: a 
black box residue computation in charge of computing each residue; a 
Chinese remaindering controller in charge of launching the computation 
and of the termination decision; an integer builder in charge of the re- 
construction computation. We then show that this design enables many 
different forms of Chinese remaindering (e.g. deterministic, early termi- 
nated, distributed, etc.), easy comparisons between these forms and e.g. 
user-transparent parallelism at different parallel grains. 



1 Introduction 

Modular methods are largely used in computer algebra to reduce the cost of co- 
efficient growth of the integer, rational or polynomial coefficients. Then Chinese 
remaindering (or interpolation) can be used to recover the large coefficient from 
their modular evaluations by reconstructing an integer value from its residues 
modulo non coprime integers. 

LiNBO5(0[n] is an exact linear algebra library providing some of the most 
efficient methods for linear systems over arbitrary precision integers. For in- 
stance, to compute the determinant of a large dense matrix over the integers 
one can use linear algebra over word size finite fields ,101 and then use a combi- 
nation of system solving and Chinese remaindering to lift the result [T3j ■ The 
Frobenius normal form of a matrix is used to test two matrices for similarity. 
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Although the Frobenius normal form contains more in formation on the matrix 
than the characteristic polynomial, most efficient algorithms to compute it are 
based on computations of characteristic polynomial (see for example [2^]). Now 
the Smith normal form of an integer matrix is useful e.g. in the computation 
of homology groups and its computation can be done via the integer minimal 
polynomial [H]. In both cases, the polynomials are computed first modulo sev- 
eral prime numbers and then only reconstructed via Chinese remaindering using 
precise bounds on the integer coefficients of the integer characteristic or minimal 
polynomials [5]. 

An alternative to the deterministic remaindering is to terminate the recon- 
struction early when the actual integer result is smaller than the estimated 
bound [Ml [121 [Hj • There after the reconstruction stabilizes for some modular 
iterations, the computation is stopped and gives the correct answer with high 
probability. 

In this paper we propose first in section [2] a linear space data structure 
enabling fast computation of Chinese reconstruction. Then we propose in sec- 
tion [3] to structure the design of a generic pattern of Chinese remaindering into 
three main modules: a black box residue computation in charge of comput- 
ing each residue; a Chinese remaindering controller in charge of launching the 
computation and of the termination decision; an integer builder in charge of 
the reconstruction computation. We show in section H] that this design enables 
many different forms of Chinese remaindering (e.g. deterministic, early termi- 
nated, distributed, etc.) and easy comparisons between these forms. We show 
then in section [5] that this structure provides also an easy and efficient way to 
provide user-transparent parallelism at different parallel grains. Any parallel 
paradigm can be implemented provided that it fulfills the defined controller in- 
terface. We here chose to use KAApj^[lB] to show the efficiency of our approach 
on distributed/ shared architectures. 

2 Radix ladder: linear structure for fast Chinese 
remaindering 

2.1 Generic reconstruction 

We are given a black box function which computes the evaluation of an integer 
R modulo any number m (often a prime number). 

To reconstruct R, we must have enough evaluations rj = R mod rrij modulo 
coprimes rrij. To perform this reconstruction, we need two by two liftings with 
U = R mod M and V = R mod N as follows: 

Rmn = U+iV -U)x {AT^ mod N) x M. (1) 

We will need this combination most frequently in two different settings: when 
M and N have the same size, and when TV is of size 1. The first generic aspect 
of our development is that for both cases, the same implementation can be fast. 

" ^http77/kaapl .gf orge . inrla.f r | 
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We first need a complexity model. We do not give much details on fast 
integer arithmetic in this paper, instead our point is to show the genericity of our 
approach and that it facilitates experiments in order to obtain goods practical 
efficiency with any underlying arithmetic. Therefore we propose to use a very 
simplified model of complexity where division/inverse/modulo are slightly slower 
than multiplication. We denote by rual" the complexity of integer multiplication 
of size / with 1 < a < 2, and ranging from 21'^ for classical multiplication to 
0{l^~^'^) for FFT-like algorithms and by dal"" the complexity of division, ranging 
also from 3P to 0(1'^+^). We refer to e.g. the GMP manual! or [THIIIS] for more 
accurate estimates. 



Size of operands 


Mul. 


Div. 


CRT 


/ X 1 
I X I 


I 

Trial"' 


2,1 


8? + 0(l) 
2(m„ + d„)P + O(0 



Table 1: Integer arithmetic complexity model 



With this in mind we compute formula ([T]) with one multiplication modulo 
as follows: 



Algorithm 1 Reconstruct 
Input: U = R mod M and V = R 
Output: Rmn = R mod M x N. 

1: Un^V -U mod iV; 

2; Mn = M-^ mod TV; 

3: Un = Un X Mm mod N; 

4: Rmn = U + Un -x M; 

5: if Rain > M x N then Rmn - 



mod N. 



Rmn - M X N end if 



Now, if the formula ([ij is computed via algorithm [T] and the operation 
counts is done using column "Mul." for multiplication and "Div." for divi- 
sion/inverse/modulo, then we have the complexities given in column "CRT" of 
table m 

2.2 Radix ladder 

Fast algorithms for Chinese remaindering rely on reconstructing pairs of residues 
of the same size. A usual way of implementing this is via a binary tree struc- 
ture (see e.g. figure [T] left). But Chinese remaindering is usually an iterative 
procedure and residues are added one after the other. Therefore it is possible to 
start combining them two by two before the end of the iterations. Furthermore, 
when a combination has been made it contains all the information of its leaves. 
Thus it is sufficient to store only the partially recombined parts and cut its 
descending branches. We propose to use a radix ladder for that task. A radix 

'' http: //gmplib . org/gmp-man-4 . 3 . .pdf | 
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ladder is a ladder composed of successive shelves. A shelf is either empty or 
contains a modulo and an associated residue, denoted respectively Mi and Ui at 
level i. Moreover, at level /, are stored only residues or moduli of size 2*. New 
pairs of residue and modulo can be inserted anywhere in the ladder. If the shelf 
corresponding to its size is empty, then the pair is just stored there, otherwise 
it is combined with occupant of the shelf, the latter is dismissed and the new 
combination tries to go one level up as shown on algorithm [2] 



Algorithm 2 RADixLADDER.insert(t/, Af) 
Input: U = R mod M and a Radix ladder 
Output: Insertion of U and M in the ladder., 

1: for i = size{M) while Shelf[«] is not empty do 

2: C/, M:=Reconstruct([/ jnodM. Ui mod il/j); 

3: Pop Shelf H; 

4: Increment z; 

5: end for 

6: Push U,M in Shelf [i]; 



Then if the new level is empty the combination is stored there, otherwise it 
is combined and goes up ... An example of this procedure is given on figure [T] 




Figure 1: A residue going up the radix ladder 



Then to recover the whole reconstructed number it is sufficient to iterate 
through the ladder from the ground level and make all the encountered par- 
tial results go to up one level after the other to the top of the ladder. As 
we will see in section 13. 3[ LinBox-1.1.7 contains such a data structure, in 
[linbox/ algorit hms/ cra-full-multip .h . 

An advantage of this structure is that it enables insertion of any size pair with 
fast arithmetic complexity. Moreover, merge of two ladders is straightforward 
and we will make an extensive use of that fact in a parallel setting in section [5l 
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Algorithm 3 RadixLadder. merge 
Input: Two radix ladders RLi and RL2- 
Output: In place merge of RLi and RL2. 

1: for i = to size(i?L2) do 

2: i?Li.insert(i?L2.Shelf[i]); 

3: end for 

4: Return RLi 



3 A Chinese remaindering design pattern 

The generic design we propose here conies from the observation that there are in 
general two ways of computing a reconstruction: a deterministic way computing 
all the residues until the product of moduli reaches a bound on the size of the 
result ; or a probabilistic way using early termination. We thus propose an 
abstraction of the reconstruction process in three layers: a black box function 
produces residues modulo small moduli, an integer builder produces reconstruc- 
tions using algorithm [21 and a Chinese remaindering controller commands them 
both. 

Here our point is that the controller is completely generic where the builder 
may use e.g. the radix ladder data structure proposed in section [2] and has to 
implement the termination strategy. 

3.1 Black box residue computation 

In general this consists in mapping the problem from Z to Z/ttjZ and com- 
puting the result modulo m. Such black boxes are defined e.g. for the de- 
terminant, valence, minpoly, charpoly, linear system solve as function objects 
[integerModular* (where * is one of the latter functions) in the linbox/ solutions] 
directory of LinBox-1.1.7. 

3.2 Chinese remaindering controller 

The pattern we propose here is generic with respect to the termination strategy 
and the integer reconstruction scheme. The controller must be able to initialize 
the data structure via the builder ; generate some coprime moduli ; apply the 
black box function ; update the data structure ; test for termination and output 
the reconstructed element. The generations of moduli and the black box are 
parameters and the other functionalities are provided by any builder. Then the 
control is a simple loop. Algorithm 0] shows this loop which contains also the 
whole interface of the Builder. 

LinBox-1.1.7 gives an implementation of such a controller, parametrized by 
a builder and a black box function as the class ChineseRemainder in linbox/algo rithms/cra-domaiii.h[ 

The interface of a controller is to be a function class. It contains a constructor 
with a builder as argument and the functional operator taking as argument a 
BlackBox, computing e.g. a determinant modulo to, and a moduli generator and 
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Algorithm 4 CRA-Control 

1: Builder .initialize{)-, 

2: while Builder .notTerminatedQ do 
3: p :— _Bm/der.nextCoPrime(); 
4: V :— BlackBox .apply {p); 
5: Builder. upda.te{v,p); 

6: end while 

7: Return i?Mi/(ier.reconstruct(); 



returning an integer reconstructed from the modular computations. Algorithm 
[S] shows the specifications of the LinBox-1.1.7 controller. Then any higher- 



Algorithm 5 C++ ChineseRemainder class 
template<class Builder> struct ChineseRemainder { 
ChineseRemainder (const Builderfe b) : builder_(b) {} 
template<class Function> Integerfe operator () ( 

Integer & res, 

const Function & BlackBox) { 
// CRA-Control . . . 

> 

const Builderfe getBuilderO { return builder_; } 
protected: Builder builder_; 

>; 



level algorithm will just chose its builder and its controller and pass them the 
modular BlackBox iteration it wants to lift over the integers. 

3.3 Integer builders 

The role of the builder is to implement the interface defined by algorithm [H 

There are already three of these implementations in LinBox-1.1.7: an early 
terminated for a single residue, an early terminated for a vector of residues 
and a deterministic for a vector of residues (resp. the files |cra -early- single . hj 
|cra- early- multip.h and |cra-f ull-multip .h in the |linbox/a lgorithiiis direc- 
tory). Up to now the radix ladder is not a separated class as only this data 
structure is currently used and as it is simple enough to inherit from one of the 
latter and modify the behavior of the methods. 

Actually |,EarlyMultipCRA inherits from both EarlySingleCRA and F ullMult ipCRA| 
as it uses the radix ladder of FullMultipCRA for its reconstruction and the early 
termination of EarlySingleCRA to test a linear combination a the residues to 
be reconstructed as shown on figure [2] 

We give more implementation details on the early termination strategies in 
sections S] and [5] 



6 



















O 


O 



GD 



O 



Linear 

combination 



w 


1 







Fulll\/lultip 



EarlySingle 



Earlyl\/lultip 

Figure 2: Early termination of a vector of residues via a linear combination 



4 Termination strategies 

Let us sketch here several early termination strategies and show that our design 
enables to modify this strategy and only that while the rest of the implementa- 
tion is unchanged. 

4.1 Earliest termination 

In a sequential mode, depending on the actual speed of the different routines 
of table [Hon a specific architecture or if the cost of BlackBox. apply is largely 
dominant, one can choose to test for termination after each call to the black 
box. A way to implement the probabilistic test of [12, Lemma 3.1] and to reuse 
every black box apply is to use random primes as the moduli generator. Indeed 
then the probabilistic check can be made with the incoming black box residue 
computed modulo a random prime. The reconstruction algorithm of section [3] 
is then only slightly modified as shown in algorithm [5] and the termination test 
becomes simply algorithm [T] 

In the latter algorithm, EarlyT erminationThreshold is the number of suc- 
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Algorithm 6 EARLYSiNGLECRA.update(i;,p) 
Global: U = R mod M. 

Global: A variable Stabilization initially set to 0. 

Input: V = R mod p. 

Output: Rmn = R mod M x p. 

1: u = U mod p; 

2: if u == V then 

3: Increment Stabilization; 

4: Return ([/,M X p); 

5: else 

6: Stabilization = 0; 

7: Return Reconstruct(C/ modAf, i; modp); 
8: end if 



Algorithm 7 EARLYSiNGLECRA.notTerminated() 

1: Return Stabilization < EarlyT ervninationThreshold; 



cessive stabilizations required to get a probabilistic estimate of failures. It will 
be denoted ET for the rest of the paper. This is the strategy implemented in 
LinBOX-1.1.7 in'linbox/algorithms/cra-early- single. h'. With the estimates of 
table [U the cost of the whole reconstruction of algorithm |4] thus becomes 

t 

(apply + 8^ + 0(1)) = 

i=l 

[t + i;r)apply + A{t + ETf + 0{t) (2) 

where t = [logj^ (i?)] and /? is the word size. 

This strategy enables the least possible number of calls to BlackBox .scp'ply . 
It it thus useful when the latter dominates the cost of the reconstruction. 

4.2 Balanced termination 

Another classic case is when one wants to use fast integer arithmetic for the 
reconstruction. Then the balanced computations are mandatory and the radix 
ladder becomes handy. The problem now becomes the early termination. There 
a simple strategy could be to test for termination only when the number of 
computed residues is a power of two. In that case the reconstruction is guaran- 
teed to be balanced and fast Chinese remaindering is also guaranteed. Moreover 
random moduli are not any more necessary for all the residues, only those test- 
ing for early termination need be randomly generated. This induces another 
saving if one fixes the other primes and precomputes all the factors Mi x (M~^ 
mod Mi+i). There the cost of the reconstruction drops by a factor of 2 from 
2(ma + da)l°' to (ma + da)l°'. 
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The drawback is an extension of the number of black box applications from 
\log2f> (i?)] + ET to the largest power of two immediately superior and thus up 
to a factor of 2 in the number of black box applies. 

For the Builder, the update becomes just a push in the ladder as shown on 
algorithm [5] 



Algorithm 8 EARLYBALANCEDCRA.update(w,p) 
1: RADixLADDER.insert(w,p); 



The termination condition, on the contrary tests only when the number of 
residues is power of two as shown on algorithm |9l 

Algorithm 9 EARLYBALANCEDCRA.notTerminated() 

1: if Only one Shelf, Shelf [i], is full then 

2: Set Ui to Shelf[i] residue; 

3: for J — 1 to EarlyTerminationThreshold do 

4: p :=PrimeGenerator(); 

5: if {Ui mod p) ! = B lack Box. apply {p) then 

6: Return false; 

7: end if 

8: end for 

9: Return true; 

10: else 

11: Return false; 

12: end if 



Then, the whole reconstruction of algorithm 2] now requires: 

ET ■ (apply + 3 • 2^) + ^ ^ (apply + (to„ + d„)2'") 

+ (apply + 3-2*)= 
(2*= + k + ET-l)- apply + (2'=)" I^±|i + 0{2^) 

operations, where now k — [log2(log23 (i?))] . 

Despite the augmentation in the number of black box applications, the latter 
can be useful, in particular when multiple values are to be reconstructed. 

Example 1. Consider the Gaufiian elimination of an integer matrix where 
all the matrix entries are larger than n and bounded in absolute value by Aoo- 
We denote \og2fi{Aoo) by a^o- Suppose one would like to compute the rational 
coefficients of the triangular decomposition only by Chinese remaindering ( there 
exist better output dependant algorithms, see e.g. \21f . but the latter has the 
same worst-case complexity). Now, Hadamard bound gives that the resulting 
numerators and denominators of the coefficients are bounded by y^nA^ . Then 
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the complexity of the earliest strategy would be dominated by the reconstruction 
where the balanced strategy or the hybrid strategy of figure [H could benefit from 
fast algorithms: 



Table 2: Early termination strategies complexities for Chinese remaindered 
GauBian elimination with rationals 

In the case of small matrices with large entries the reconstruction dominates 
and then a balanced strategy is preferable. Now if both complexities are com- 
parable it might be useful to reduce the factor of 2 overhead in the black box 
applications. This can be done via amortized techniques, as shown next. 

4.3 Amortized termination 

A possibility is to use the p-amortized control of [5]: instead of testing for 
termination at steps 2^, 2^, . . ., 2', ... the tests are performed at steps 
yo3(^\ . . ., . . . with 1 < p < 2 and / satisfies V«, g{i) < i. If the complexity 

of the modular problem is C and the number of iterations to get the output is 
b, [2] give choices for p and g which enable to get the result with only b + 
iterations and extra 0{f{b)) termination tests where f{b) = logpib). 

In example [T] the complexity of the modular problem is , the size of the 
output and the number of iterations is naoo so that strategy would reduce 
the iteration complexity from 2n"^^aoo to {naoo + o{naoo))'n'^ and the overall 
complexity would then become: 





+ log{naoc)n"a'^) 


|EarlyAmortizedCRA| 



Indeed, we suppose that the amortized technique is used only on a linear 
combination, and that the whole matrix is reconstructed with a |FullMult ipCRA[ 
as in figure [2] Then the linear combination has size 2 log(n) + n • Uoc which is 
still 0{n ■ Qoo). Nonetheless, there is an overhead of a factor log(naoo) in the 
linear combination reconstruction since there might be up to O(log(naoo)) values 
p3M^ , _ , between any two powers of two. Overall this gives the above 

estimate. Now one could use other g functions as long as eq. |3]is satisfied. 



5 Parallelization 

All parallel versions of these sequential algorithms have to consider the parallel 
merge of radix ladders and the parallelization of the loop of the CRA-control 



EarlySingleCRA" I 0(n''a^) 

larlyMultipCRA ©(n^+^aoo + n^+^a^^ + ri^a^) 
EarlyBalancedCRA O(2n"+^aoo + n^+"Q^) 





10 



algorithm Many parallel libraries can be used, namely OpenMP or Cilk 
would be good candidates for the parallelization of the embarrassingly parallel 
[FullMultipCRA, Now in the early termination setting, the main difficulty comes 
from the distribution of the termination test. Indeed, the latter depends on data 
computed during the iterations. To handle this issue we propose an adaptive 
parallel algorithm [51 and use the Kaapi library [BJ [T^ . Its expressiveness 
in an adaptive setting guided our choice, together with the possibility to work 
on heterogenous networks. 

5.1 Kaapi overview 

Kaapi is a task based model for parallel computing. It was targeted for dis- 
tributed and shared memory computers. The scheduling algorithm uses work- 
stealing [3J [U 131 [Hj : an idle processor tries to steal work to a randomly selected 
victim processor. 

The sequential execution of a Kaapi program consists in pushing and pop- 
ping tasks to dequeue the current running processor. Tasks should declare the 
way they access the memory, in order to compute, at runtime, the data flow 
dependencies and the ready tasks (when all their input values are produced). 
During a parallel execution, a ready task, in the queue but not executed, may 
be entirely theft and executed on an other processor (possibly after being com- 
municated through the network). These tasks are called dfg tasks and their 
schedule by work-stealing is described in [THl [13 ■ 

A task being executed by a processor may be only partially theft if it interacts 
with the scheduler, in order to e.g. decide which part of the work is to be given 
to the thieves. Such tasks are called adaptive tasks and allows fine grain loop 
parallelism. 

To program an adaptive algorithm with Kaapi, the programmer has to spec- 
ify some points in the code (using kaapi_stealpoint) or sections of the code 
( |kaapi_stealbegin[ |kaapi_stealendp where thieves may steal work. To guar- 
antee that parallel computation is completed, the programmer has to wait for 
the finalization of the parallel execution (using kaapi_steal_f inalize). More- 
over, in order to better balance the work load, the programmer may also decide 
to preempt the thieves (send an event via k aapi_preempt_nextp . 

5.2 Parallel earliest termination 
Algorithm 1101 lets thieves steal any sequence of primes. 

At line[31 the code allows the scheduler to trigger the processing of steal requests 
by calling the splitter function. The parameters of kaapi_stealbegin are the 
splitter function and some arguments to be given to its call. These argumentfQ 
can e.g. specify the state of the computation to modify (here the builder object 
plays this role). Then, on the one hand, concurrent modifications of the state of 
computation by thieves, must be taken care of during the control fiow between 

^in or out 
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Algorithm 10 ParallelCRA-Control 

1: Builder .initialize{)-, 

2: while Builder .notTerminatedO do 

3: p := BuiWer.nextCoPrimeO; 

4: kaapi_stealbegin( splitter, Builder); 

5: V B I ack Box. apply (p); 

6: Builder. update{v,p); 

7: kaapi_finalize_steal(); 

8: kaapi_stealend(); 

9: if require synchronization step then 

10: while kaapi_nomore_thief do 

11: {list of v,list of p) :=kaapi_preempt_next(); 

12: Builder .\ipdate{list of v, list of p); 

13: end while 

14: end if 

15: end while 

16: Return Bitj/der.reconstructO; 



lines ID and [8j here the computation of the residue could be evaluated by multiple 
threads without critical sectiorH. On the other hand, after line [8j the scheduler 
guarantees that no concurrent thief can modify the computational state when 
they steal some work. Remark that both branches of the conditional If at linelS] 
must be executed without concurrency: the iteration of the list of thieves or the 
generation of the next random modulo are not reentrant. 

The role of the splitter function is to distribute the work among the thieves. 
In algorithmllll each thief receives a coPrimeGenerator object and the entrypoint 
to execute. 



Algorithm 11 SPLmER{Builder, N,requests[]) 
1: for i = to iV - 1 do 

2: kaapi_request_reply(regMesf[i], entrypoint, 
Builder. getCoPrimeGenerator{) ); 

3: end for 



The [coPrimeGenerator depends on the IBuilderi type and allows the thief 
to generate a sequence of moduli. For instance the |coPr imeGener atorl for the 
earliest termination contains at one point a single modulo M which is returned 
by the next call of nextCoPrimeO by the Builder, 

The splitter function knows the number N of thieves that are trying to steal 
work to the same victim. Therefore it allows for a better balance of the work 
load. This feature is unique to Kaapi when compared to other tools having a 
work-stealing scheduler. 

^This depends on the implementation, most of the LinBox library functions are reentrant 
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5.3 Synchronization 

Now, the victim periodically tests the global termination of the computation 
(linel^of algorithm fTU)) . Depending on the chosen termination method (Early*CRA| 
etc.), the synchronization may occur at every iteration or after a certain num- 
ber of iterations. The choice is made in order to e.g. amortize the cost of this 
synchronization or reduce the arithmetic cost of the reconstruction. Then each 
thief is preempted (line [TT|) and the code recovers its results before giving them 
to the iBufrderl for future reconstruction (line [T^ . 

The preemption operation is a two way communication between a victim and 
a thief: the victim may pass parameters and get data from one thief. Note that 
the preemption operation assumes cooperation with the thief code. The latter 
being responsible for polling incoming events at specific points (e.g. where the 
computational state is safe preemption- wise) . 

On the one hand, to amortize the cost of this synchronization, more primes 
should be given to the thieves. In the same way, the victim code works on a 
list of moduli inside the critical section (at line [3] returns a list of moduli, and 
at lines [5]|6] the victim iterates over this list by repeatedly calling apply and 
^pdate methods). On the other hand, to avoid long waits of the victim during 
preemption, each thief should test if it has been preempted to return quickly its 
results (see next section). 

5.4 Thief entrypoint 

Finally, algorithm [T^ returns both the sequence of residues and the sequence 
of primes that where given to the BlackBox. This algorithm is very similar to 
algorithm [TOl 



Algorithm 12 Thief's EntryPoint(M) 



1: 


BuzWer.initializeO; 


2; 


list of w.clearO; 


3: 


list of p.clear(); 


4: 


while _BuiZ(ier.CoPrimeGenerator() not empty do 


5: 


if kaapi_preemptpoint() then break; end if 


6: 


p SuiWer.nextCoPrimeO; 


7: 


kaapi_stealbegin( splitter, Builder); 


8: 


list of p.push_back(p); 


9: 


list of t;.push_back(_BZacfci3oa;.apply(p)); 


10: 


kaapi_stealend() ; 


11: 


end while 


12: 


kaapi_stealreturn {list of v, list of p); 



Lines [7] and [10] define a section of code that could be concurrent with steal 
requests. At lineO the code tests if a preemption request has been posted by 
algorithm [10] at line [TTJ If this is the case, then the thief aborts any further 
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computation and the result is only a partial set of the initial work allocated by 
the splitter function. 

5.5 Efficiency 

These parallel versions of the Chinese remaindering have been implemented 
using Kaapi transparently from the LinBox library: one has just to change 
the sequential controller era- domain. h to the parallel one. 

In LinBox-1.1.7 some of the sequential algorithms which make use of some 
Chinese remaindering are the determinant, the minimal/characteristic polyno- 
mial and the valence, see e.g. [1^1 [121 [H] [5] for more details. 

We have performed these preliminary experiments on an 8 dual core machine 
(Opteron 875, 1MB L2 cache, 2.2Ghz, with SOGBytes of main memory). Each 
processor is attached to a memory bank and communicates to its neighbors via 
an hypertransport network. We used g++ 4.3.4 as CH — h compiler and the Linux 
kernel was the 2.6.32 Debian distribution. 

All timings are in seconds. In the following, we denote by Tseq the time of 
the sequential execution and by Tp the time of the parallel execution for p = 8 
or p = 16 cores. All the matrices are from "Sparse Integer Matrix Collection" 
(SIMCB 

Table[3]gives the performance of the parallel computation of the determinant 
for small invertible matrices (less than a second) and larger ones (an hour CPU) 
of the .SIMC/SPG and SIMC/Tref ethen collections. 



Matrix 


d,r 


Tseq [k] 


Tp=8[k] 


Tp=i6[k] 


ex — 1 
ex — 3 


560, 8736 
2600, 71760 


0.336[4] 
914.28[183] 


0.386[60] 
247. 75 [303] 


0.533[124] 
102.75[263] 


150 
t-300 
t~500 
t-700 
t - 2000 


150, 2040 
300, 4678 
500, 8478 
700, 12654 
2000, 41907 


0.26[59] 
2.90[138] 
17.65[249] 
58.25[367] 
3208.59[1274] 


0.14[135] 
0.88[143] 
3.05[319] 
11. 45 [386] 
708.506[1295] 


0.13[263] 
0.48[255] 
1.57[255] 
7.56[376] 
434.31 [1286] 



Table 3: Timings for the computation of the determinant, d is the dimension 
of the matrix, r the number of non-zero coefficients, [k] is the minimal number 
of primes observed for the Chinese remaindering using p cores. 

The small instance (ex-1) needed very few primes to reconstruct integer the 
solution. There, we can see the overhead of parallelism: this is due to some 
extra synchronizations and also to the large number of unnecessary modular 
computations before realizing that early termination was need. Despite this 
(124 modular computations with 16 cores against 4 modular applications in 
sequential) the overhead was less than 0.2 second. Now, for matrix ex-3, the 
speed up for 16 cores was a little bit more than 8: with the large amount of 

' ^http : //I j k ■ lmag7fr7CA SYS/SIHC] 
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memory, we assume that the memory bandwidth is the bottleneck when two 
cores on the same processor share the memory interconnection. 



Matrix 


d 




Tp=4[k] 


Tp=8[k] 


chSS.b5 


564480 X 376320 


73.78[5] 


40.53[6] 


47. 58 [5] 


n4c5.65 
n4c5M 
n4c5.67 
n4c6.65 


4340 X 2852 
4735 X 4340 
3635 X 4735 
51813 X 20058 


5.62[16] 
25.15[32] 
29.87[38] 
92.29[18] 


2.09[17] 
10.07[33] 
12.17[42] 
34. 16 [22] 


3.09[35] 
9.12[35] 
9.07[40] 
35.45[29] 



Table 4: Times for the computation of the valence, d is the dimension of the ma- 
trix, [k] is the minimal number of primes observed for the Chinese remaindering 
using p cores. 

Table S] shows the performance of the parallel computation of the valence of 
A* A. Matrices in this table come from the SIMC/Homology collection and were 
used in [T^], where the parallelism was ad- hoc. 

Further analysis is required to identify hot spots in the executions in or- 
der to improve the efficiency at this computational grain. First at all, we 
need to increase the size of the critical section between |kaapi_stealbegiii| 
pjaapi_stealend: e.g. by generating in parallel random primes. We already 
found some bottlenecks due to memory allocation that may be removed. Be- 
sides, it remains difficult on such architecture to control the computational 
affinity with the mapping of data in memory. 

6 Conclusion 

We have proposed a new data structure, the radix ladder, capable of managing 
several kinds of Chinese reconstructions. 

Then, we have defined a new generic design for Chinese remaindering schemes. 
It is summarized on figure [31 Its main feature is the definition of a builder 
interface in charge of the reconstruction. This interface is such that any of ter- 
mination (deterministic, early terminated, distributed, etc.) can be handled by 
a CRA controller. It enables to define and test remaindering strategies while 
being transparent to the higher level routines. Indeed we show that the Chinese 
remaindering can just be a plug-in in any integer computation. 

We also provide in LinBox-1.1.7 an implementation of the ladder, several 
implementations for different builders and a sequential controller. Then we 
tested the introduction of a parallel controller, written with Kaapi, without 
any modification of the LinBox library. The latter handles the difficult issue of 
distributed early termination and shows good performance on a SMP machine. 

In parallel, some improvement could be made to the early termination strat- 
egy in particular when the BlackBox is fast compared to the reconstruction and 
when balanced and amortized techniques are required. Also, output sensitive 
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BlackBoxes 
Determinant 
Minpoly ^ 
Valence Ji" 



Controllers 



CRA- 


-Control 


Parallel-CRA 




Splitter 




ThiefEntryPoint 



Builders 



Early * CRA 
FullMultlpCRA 



RaddixLadder 



Figure 3: Generic Chinese remaindering scheme 



early termination is very useful for rational reconstruction, see e.g. 20^ and 
thus the latter should benefit from this kind of design. 
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